THE SPECTRAL NORM ERROR OF THE NAIVE NYSTROM EXTENSION 



ALEX GITTENS 

Abstract. The naiVe Nystrom extension forms a low-rank approximation to a positive-semidefinite 
matrix by uniformly randomly sampling from its columns. This paper provides the first relative- 
error bound on the spectral norm error incurred in this process. This bound follows from a natural 
connection between the Nystrom extension and the column subset selection problem. The main 
tool is a matrix Chernoff bound for sampling without replacement. 



1. Introduction 

Nystrom extensions are a class of algorithms that quickly form low-rank approximations to pos- 
itive semidefinite (PSD) matrices by sampling from their columns. We consider the naive Nystrom 
extension, a particular scheme in which the columns are sampled uniformly without replacement. 
By exploiting a natural connection between Nystrom extensions and the column subset selection 
problem, we find the first relative-error spectral norm guarantees for the naive Nystrom extension. 

1.1. Efficacy of the naive Nystrom extension. Perhaps surprisingly, given that one uses no 
information about the matrix itself to make the column selections, the naive Nystrom extension is 
effective in practice. Because of its data agnosticism and empirical accuracy, the naive Nystrom 
extension is a natural choice for any application where one wishes to avoid the cost of examining 
(or even constructing) the entire dataset before approximation. 

The naive Nystrom extension has proven to be particularly useful in image-processing applica- 
tions, which typically involve computations with large dense matrices [FBCMOU IWDT"'"69l IBF] . 
In spectral image segmentation, for example, one constructs a matrix of pairwise pixel affinities 
by comparing neighborhoods of each pair of pixels. Several leading eigenvectors of this matrix are 
then used to segment the image. The affinity matrix of an x image has dimension N'^ x N'^, so 
it is challenging to construct and hold the affinity matrix in memory even for images of a moderate 
size. Similarly, the density and size of the affinity matrix makes it challenging to compute the 
leading eigenvectors. [FBCMOl] proposes using the naive Nystrom extension to approximate the 
eigenvectors of the affinity matrix. Doing so allows one to work with much larger images, because 
it is only necessary to compute a fraction of the columns of the affinity matrix. 

1.2. Structure of the Nystrom extension. Let A be a PSD matrix of size n. Select i <^ n 
columns of A to constitute the columns of a matrix C. Let W be the i x i matrix formed by the 
intersection of the columns in C and the corresponding rows in A. The matrix CVF^C* is then a 
Nystrom extension of A (see Figure [l|. Here (•)^ denotes Moore-Penrose pseudoinversion. Since 
W is a principal submatrix of A, it is positive-semidefinite, and hence the Nystrom extension is 
also positive-semidefinite. 

The manner in which the columns are sampled and is calculated or approximated determines 
the type of the Nystrom extension. Various sampling schemes have been proposed, ranging from 
the fast and simple naive scheme in which the columns are selected uniformly at random without 
replacement to more sophisticated and calculation-intensive schemes that involve sampling from a 
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Figure 1. The Nystrom extension procedure 



distribution determined by the determinants of principal submatrices of A |BW09| . In practice the 
naive scheme represents a favorable trade-off between speed and accuracy |KMT09b] . 

1.3. Nystrom approximation of invariant subspaces. In many applications, including the 
image-processing example taken from [F BCMOl] , the Nystrom extension is used to obtain approx- 
imations to the dominant invariant subspace of a PSD matrix A, rather than a low-rank approx- 
imation |HMj . Through the Davis-Kahan sin0 theorem, the spectral norm approximation error 
provides information on the quality of the approximate invariant subspace obtained via Nystrom 
extensions |Bha971 Section VII. 3]. 

To be more precise, let A be a Nystrom approximation to A and assume both A and A have 
unique dominant fc-dimensional invariant subspaces. Let Uk and Uk have orthogonal columns and 
span, respectively, the dominant fc-dimensional invariant subspace of A and that of A. Recall one 
natural definition for the distance between the dominant fe-dimensional invariant subspaces of A 
and A, 



dist(C/fc, Uk 



2- 



Denote the fcth-largest eigenvalue of a matrix M by Afc(iW), so that Ai(iW) > X2{M) > 
follows from the Davis-Kahan sin0 theorem that if Afc(A) — Afc-i-i(A) > 0, then 

distiUk,Uk) < .[i^","^"' 

Afe(A) - Afc+i(A) 

Assume that we have a relative-error spectral norm bound of the form 

IIA- Alio < CA,(A), 



It 



(1) 



for some j > k. Then equation ([T]) becomes 

distiUk, tJk) < 



CAj(A) 



Afc(A)-Afc+i(A)-CA,(A) 

Thus, we conclude that if CAj(A) is sufficiently smaller than the eigengap Afc(A) — Aa;+i(A), the 
Nystrom extension yields a quality approximation to the dominant A;-dimensional invariant subspace 
of A. 

This paper presents a simple framework for the analysis of Nystrom schemes that yields a state-of- 
the-art spectral norm error bound in the case of the naive Nystrom extension scheme. Specifically, it 
generalizes the coherence-based exact recovery result in |TR10j to also guarantee small relative error 
in the case of a matrix with a fast-decaying spectrum. This is the first truly relative-error spectral 
norm bound available for any Nystrom extension method. When the eigengap Afc(A) — Afc+i(A) is 
sufficiently large, our result sanctions the use of the naive Nystrom extension for the approximation 
of the dominant /c-dimensional invariant subspace of A. 
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1.4. Our relative-error spectral norm bound. The efficacy of the naive Nystrom extension is 
of course dependent on the data set to which it is applied. Intuitively, the extension should perform 
better if the information is spread evenly throughout the columns of the matrix. The coherence of 
the invariant subspaces of A provides a quantitative measure of the informativity of the columns. 
Let 5 be a fc-dimensional subspace of M" and Ps denote the projection onto S. Then the coherence 
of S is 

Tl 

fJ-oiS) = -maxi{Ps)ii. 

Corollary[T]is a condensed version of our main result, Theorem[2| and uses the notion of coherence 
to provide a bound on the error of the naive Nystrom extension. 

Corollary 1. Let A be a real PSD matrix of size n. Given an integer k < n, let r denote the 
coherence of a dominant k-dimensional invariant subspace of A. Fix a nonzero failure probability 
5- If ^ ^ 8t/c log(/c/5) columns of A are chosen uniformly at random without replacement, then 

\\A - CW^C% < Afe+i(A) (^1 + ^) 

with probability exceeding 1 — 5. 

For Corollary [T] to provide a meaningful estimate of £, the required number of column samples, 
the coherence r must be small enough that £ <C n. This requirement reflects our intuition that 
approximations formed using a small number of columns will not be accurate if a small number of 
columns are significantly more influential than the others. In the best-case scenario of r = 1, the 
columns are equally informative and we find that the error of a naive Nystrom extension formed 
using just CA; log k columns is close to that of the optimal rank-/c approximation. 

1.5. Relevant literature. We briefly review the literature on Nystrom extensions, focusing on 
the naive Nystrom scheme. In this section A is a PSD matrix, A^. is a rank-A; approximation to A 
that is optimal in the spectral norm, and £ is the number of columns used to construct a Nystrom 
extension of A. 

Williams and Seeger introduce the Nystrom extension in [WSOlj . based upon a similar method 
used in numerical integral equation solvers, as a heuristic method for efficiently approximating 
the eigendecomposition of kernel matrices. In this seminal work, only an empirical analysis of 
the approximation error is offered. Drineas and Mahoney provide the first rigorous analysis of a 
Nystrom extension in [DMOSj : in the scheme they consider, columns are sampled with probability 
proportional to the square of the diagonal entries of A. In addition to probabilistic schemes, many 
adaptive sampling schemes have been proposed. These attempt to progressively choose the columns 
to decrease the approximation error. For an introduction to this body of literature, we refer the 
interested reader to the discussion in [FGKllj . 

Kumar, Mohri, and Talwalkar attempt the first analysis of the naive Nystrom extension in 
|KMT09b 3 , resulting in bounds for the Frobenius norm error. Their analysis proceeds by bounding 
the expectation and variance of the error then applying a concentration of measure argument. A 
simplified yet representative statement of their bound is that 

\\A-CW^&\\f < ||A- Afc||F + en -max (A)ii 

i 

with constant probability when I > Ck/e^. 

In |KMT09a] . Kumar, Mohri, and Talwalkar establish that if rank(W) = rank(A) = r, then 
A = CW^C^. Talwalkar and Rostamizadeh prove this implies that, if ^ > Crfj,log{r /6), naive 
Nystrom extension results in exact recovery with constant probability |TR10j . Here /i is a measure 
of the coherence of the column space of A that differs slightly from the definition used in this 
paper. Their key observation is that if no columns of A are singularly influential, then W will have 
maximal rank when i is slightly larger than the rank of A. Thus, the number of samples required 
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for exact recovery is determined by the rank of A and the coherence, /i, of its range space. They use 
a standard result from the compressed sensing hterature to quantify this phenomenon and obtain 
an estimate for £ |CR07| . 

In |LKL10j . Kwok, Li, and Lu propose replacing W with a low-rank approximation W to facili- 
tate the pseudoinversion operation. This large-scale variant of the naive Nystrom extension allows a 
larger number of column samples to be drawn and leads to smaller empirical approximation errors. 
The approximation W is constructed using the randomized methodology espoused in [HMTll] . 
The analysis of the error combines bounds provided in jHMTll] with a matrix sparsification ar- 
gument. In addition to I and k, the Nystrom algorithm presented in [LKLlOj depends on two 
additional parameters that control the creation of W: an oversampling factor p and the number of 
iterations of the power method, respectively p and q. After taking p = I — k and g = 1, the results 
of |,LKL10j provide error bounds for the naive Nystrom extension: 

E|| A - CW^&h < C (^11 A - Akh + ^ max(A)ii^ (2) 

E|| A - CW^&Wf < CVi (^\\A - Ak\\F + ^ max(A),,^ . 

Of the works mentioned, only [LKLlOj provides a bound on the spectral error of the Nystrom 
method for A of arbitrary rank. Unfortunately, the quantity maxi(A)jj is, for a general A ^ 0, 
bounded only by Ai(A). Thus equation ^ does not provide a relative-error bound. In fact, the 
spectral norm error bound provided in this paper is always tighter than the bound provided in 
[LKLlOj . for any choice of p and q. 

Our work presents an intuitive and simple approach to the analysis of Nystrom extensions through 
their connection to the randomized column subset selection problem. This allows us to obtain the 
first truly relative-error guarantee on the spectral norm error. This paper analyzes the naive 
sampling scheme but we believe that the framework given is flexible enough to be fruitfully applied 
to the analysis of other Nystrom extension schemes including, in particular, the large-scale variant 
introduced in [LKLlOj . 

1.6. Outline. In Section [2] we introduce our notation and review some algebraic preliminaries. 
In Section |3] we establish a connection between the Nystrom extension procedure and the column 
subset selection problem. We exploit this connection and a result from jHMTllj to provide a 
general error bound for any Nystrom extension scheme. In Section |4] we specialize this result to 
the case of the naive Nystrom extension. 



2. Notation 

We work exclusively with real matrices and order the eigenvalues of a PSD matrix A so that 
Ai(A) > A2(A) > • • • > A„(A). Each PSD matrix A has a unique square root A^/^ that is also 
positive-semidefinite, has the same eigenspaces as A, and satisfies A = (A-*^/-^) . 

The projection onto the column space of a matrix M is written Pm and satisfies 

Pm = MM'^ = M(M^M)^M\ 

The notation [x)j refers to the jth entry of the vector x, and {M)i refers to the ith. column of the 
matrix M. Likewise, {M)ij refers to the entry of M. 

The coherence of a matrix U S M"'^^ with orthonormal columns is, up to a scaling factor, the 
maximum of the squared Euclidean norms of its rows: 

Til Til T\i 

Ho{U) := -maxj ||(J7*)i||^ = - maxj(L/"J7*)ii = - maxi{Pu)ii. 
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From the last equality, we see that the coherence is in fact an intrinsic property of the subspace 
spanned by U. Thus we refer to the coherence of a subspace without first choosing a particular 
orthogonal basis U. 

3. The connection to the column subset selection problem 

In this section we establish a fruitful connection between the performance of the Nystrom exten- 
sion and the performance of randomized column subset selection. 

Given a matrix M, the goal of column selection is to choose a small but informative subset 
C of the columns of M so that, after approximating M with the matrix obtained by projecting 
M onto the span of C, the residual (/ — Pc)M is small in some norm. In randomized column 
subset selection, the columns C are choosen randomly, either uniformly or according to some 
data-dependent distribution. Column subset selection has important applications in statistical 
data analysis and has been investigated by both the numerical linear algebra and the theoretical 
computer science communities. For an introduction to the column subset selection literature, 
biased towards approaches involving randomization, we refer the interested reader to the surveys 
IMahallMahb] . 

Our first theorem establishes that the Nystrom extension of A is intimately related to the 
randomized column subset selection problem for A^l"^ . We model the column sampling operation 
as follows: let be a random matrix with I columns, each of which has exactly one nonzero element. 
Then right multiplication by S selects I columns from A: 

C = AS and W = S^AS. 

The distribution of S reflects the type of sampling being performed. In the case of the naive 
Nystrom extension, S is distributed as the first I columns of a uniformly random permutation 
matrix. 

We use the following partitioning of the eigenvalue decomposition of A to state our results: 

k n—k 

(3) 

The columns of Ui and U2 respectively span a dominant A;-dimensional invariant subspace of A 
and the corresponding bottom (n — /c)-dimensional invariant subspace of A. The interaction of the 
column sampling matrix S with the invariant subspaces spanned by Ui and U2 is captured by the 
matrices 

Qi = U{S, ^2 = UiS. (4) 

Theorem 1. Let A he a PSD matrix of size n and let S be an n x H. matrix. Partition A as in 
equation ([s]) and define fii and ^2 o.s in equation Q . 

Assume fii has full row rank. Then the spectral approximation error of the Nystrom extension 
of A using S as the column sampling matrix satisfies 

\\A-CW^C% = ||(/-P41/25)aV2||2 < IIS2II2 (1 + llfiafillli) • (5) 

Prior analyses of Nystrom extensions have used the Cholesky decomposition of A. By instead 
using the square-root. Theorem [T] establishes an equivalence between the column subset selection 
problem and the Nystrom extension procedure and gives a deterministic relative-error bound on 
the performance of Nystrom extensions. 

To establish Theorem [l} we use the following bound on the error incurred by projecting a matrix 
onto a random subspace of its range ( [HMTlTl Theorem 9.1]). 



[UiU2^ 



^1 




ul 


S2 
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Proposition 1. Let M he a PSD matrix of size n. Fix integers k and i satisfying 1 < k < I < n. 

Let Ui and U2 be matrices with orthogonal columns spanning, respectively, a dominant k- 
dimensional invariant suhspace of M and the corresponding bottom {n — h)- dimensional invariant 
subspace of M. Let 'S2 be the diagonal matrix of eigenvalues corresponding to the bottom (n — A;)- 
dimensional invariant subspace of M. 

Given a matrix S of size n x i, define fii = U^S and fl2 = U\S. Then, assuming that fii has 
full row rank, 

- Pms)M\\1 < IIS2II2 + 115^2^2^1112. 



Proof of Theorem [i| We write the Nystrom extension in terms of the square root of A and a 
projection onto the space spanned by A^^^S : 

CW^C^ = AS{S^AS)^S^A 

= A'/^[A'/^S{S'A'/'A^/^Sys'A'/^]A'/^ 

— ill/2 p 4 1/2 

It follows that the spectral error of the Nystrom extension satisfies 

II A - CW^C% = II aV2(i _ P^,^,^)A^/\ = \\A^/\i - P^.r^sfA'/\ 

= \\iI-PA^/^s)A'/X- 

The second equality holds because of the idempotency of projections. The third follows from the 
fact that ||AA*||2 = ||A||| for any matrix A. Partition A as in equation ([3]). Equation ([s]) now 
follows immediately from Proposition [I] with M = Ai/2. □ 

4. Error bounds for naive Nystrom extension 

In this section, we provide a bound on the spectral norm approximation error of naive Nystrom 
extensions. For convenience, we recall the following partitioning of the eigenvalue decomposition 
of a PSD matrix of size n: 



k n—k 

k n—k 



[UlU2\ 



"Si 






^2. 







(6) 



where the columns of Ui and U2 respectively span a dominant /c-dimensional invariant subspace of 
A and the corresponding bottom (n — /cj-dimensional invariant subspace of A. We also recall the 
matrices 

fii = U{S, ^2 = U^S (7) 
that capture the interaction of the column sampling operation with the invariant subspaces of A. 
Theorem [2] establishes that, if the spectrum of A decays sufficiently and an appropriate number of 
columns are sampled, then the error incurred by the naive Nystrom extension process is small. 

Theorem 2. Let A he a PSD matrix of size n. Given an integer k < n, partition A as in equation 
Q. Let T denote the coherence ofUi, 

T = flo{Ui). 

Fix a failure probability 5 G (0, 1). For any e € (0, 1), if 

2rfelog^ 

columns of A are chosen uniformly at random and used to form a Nystrom extension, the spectral 
norm error of the approximation satisfies 

||A-CWtC*||2< Afc+i(A)(l + ^ 
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with probability exceeding 1 — 5. 

Remark 1. The coherence of Ui is a measure of how much comparative influence the individual 
columns of A have over the dominant A:-dimensional invariant subspace of A spanned by Ui: if r 
is small, then all columns have essentially the same influence; if r is large, then it is possible that 
there is a single column in A which alone determines one of the top k eigenvectors of A. 

For illustrative purposes, we point out that the coherence of a random kxn orthogonal matrix, i.e. 
a matrix distributed uniformly on the Stiefel manifold, is 0(max(A;,logn)//c) with high probability 
|CR09| . The coherence of an arbitrary Ui is no smaller than 1, and may be as large as ^. 

Remark 2. Theorem [2| like the main result of [TRIO], promises exact recovery when A is ex- 
actly rank k and has small coherence, with a sample of 0(A;logA;) columns. Unlike the result in 
|TR10| . Theorem [2] is applicable in the case that A is full-rank but has a sufficiently fastly decaying 
spectrum. 

Remark 3. We might ask where attempts at sharpening the analysis of the naive Nystrom extension 
should be aimed: toward more refined linear algebra bounds on column selection (Proposition [T]), 
or toward a deeper analysis of the randomness (Theorem [2])? 

It is known that, for uniform sampling, the quantity H^^lUg remains r2(n/£) once i > rklogk 
|Rud99| . Likewise, in the regime of low r and k <^ n, fl2 is likely to be an almost isometric 
embedding. This follows from the fact that the columns of ft2 contain n — of the n entries of the 
corresponding columns of U2, so they are likely to be almost orthogonal and linearly independent. 
Together, these observations suggest that ||ri2f^i||2 I'^mains Q{n/£) also. Thus, we expect that no 
bounds much sharper than Theorem [2] on the error of the naive Nystrom extension can be derived 
using the algebraic results in Proposition [T] as the starting point. 

To obtain Theorem jij we use Theorem [l] in conjunction with a bound on ||^^^||2 provided by the 
following lemma. 

Lemma 1. Let U be annx k matrix with orthonormal columns. Take t to be the coherence ofU, 

Select e E (0, 1) and a nonzero failure probability 5. Let S be a random matrix distributed as the 
first I columns of a uniformly random permutation matrix of size n, where 

2t , , A; 

^-(r^'^°^5- 

Then with probability exceeding 1 — S, the matrix U^S has full row rank and satisfies 



We now proceed with the proof of Theorem [2} 

Proof of Theorem^ Because we are using uniform sampling without replacement, the sampling 
matrix S is formed by taking the first i columns of a uniformly sampled random permutation 
matrix. Note that bvLemma fTl fti has full row rank, so the bounds in Theorem fTl are applicable. 

n — 1 1 1 1 1 2 — m 

Applying Lemma 1, we see that H^^IHg ^ with probability exceeding 1 — 6. From Theorem ll 
we conclude that 

||A - CW^C% < IIS2II2 (1 + 11^^211211^1112) < Xk+i{A) (1 + 



with at least the same probability. To obtain the second inequality, we used the fact that 1 1 2 1 1 2 ^ 

Il^2iyi"ll2<i- 

□ 
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One potential source of difficulty in the proof of Lemma [T] is the fact that the columns are 
sampled without replacement, which introduces dependencies among the entries of the sampling 
matrix S. The following matrix Chernoff bound, a standard simplification of the lower Chernoff 
bound developed in [Troll j Theorem 2.2], allows us to gloss over these dependencies. 

Proposition 2. Let X be a finite set of PSD matrices with dimension k, and suppose that 

max Ai(X) < B. 

Sample {Xi, . . . ,Xi} uniformly at random from X without replacement. Compute 

fJ-min = ^ ■ Afc(IEXi). 

Then 

P {a, Xi) < e/^^n} < k ■ e-(i--)V™./(2B) f^^ £ e [0, 1]. 

Proof of Lemma^ Note that U''S has full row rank if Xk{U*SS^U) > 0. Furthermore, 

\\{u^s)ml = X^\U^SS^U). 

Thus to obtain both conclusions of the lemma, it is sufficient to verify that 

XkiU'SS'U) > - 

n 

when i is as stated. 

We apply Proposition [2] to bound the probability that this inequality is not satisfied. Let Ui 
denote the ith column of [/*. Then 



where the Xi are chosen uniformly at random, without replacement, from the set X = {wi'U-}j=i^.,,^„. 
Clearly 

k £ £ 

B = max \\ui\\^ = -t and /imin = i ■ AfcfEXi) = -Xk(U^U) = -. 

in n n 

Proposition [2] yields 

\, (U'SS'U) <e^\<k- e-(i-^)'^/(2M. 



n 

We require enough samples that 

XkiU^SS^U) > e- 

n 

with probability greater than 1 — 5, so we set 
and solve for £, finding 

2t , , A; 

Thus, for values of i satisfying this inequality, we achieve the stated spectral error bound and 
ensure that U^S has full row rank. □ 
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